The files has been split into Task1 and one for Task 2, 3 and 4. This has been done after we recieved an email telling us we needed this to deliver. The Full project is the unedited version.
If you prefer to use the GitHub hosted files, then use these links
rm(list=ls()) #cleaning our environment# Load required librarieslibrary(tidyverse)library(haven)library(curl)library(utils)library(janitor)library(glue)library(leaflet)library(knitr)library(plotly)library(readr)# Code to be used to see what type of category we get to work withcategory <-c("Analgesics","Bath Soap","Beer","Bottled Juices","Cereals","Cheeses","Cigarettes","Cookies","Crackers","Canned Soup","Dish Detergent","Front-end-candies","Frozen Dinners","Frozen Entrees","Frozen Juices","Fabric Softeners","Grooming Products","Laundry Detergents","Oatmeal","Paper Towels","Soft Drinks","Shampoos","Snack Crackers","Soaps","Toothbrushes","Canned Tuna","Toothpastes","Bathroom Tissues")letter2number <-function(x) {utf8ToInt(x) -utf8ToInt("A") + 1L}seed_number <-sum(letter2number("Daniel")) +sum(letter2number("Daniel"))set.seed(seed_number)cat(glue("Our seed number is {seed_number} so our category is {sample(category, 1)}")) #Making a print function to display nicer in html
Our seed number is 410 so our category is Shampoos
Code
# We free up ram by removing old dataframes we dont need to keep loaded.rm(category, letter2number, seed_number)
We start by loading the packages we will use to do the tasks. Then we run the code that ensures we know which product category we are given with our “seed” which is Shampoo.
Task 2
We have chosen the specific brands because some of the brands left the marked and some came into the marked over the years and 1994 was the most consistent for the brands we had chosen.
We run a code that groups all the stores and sums across all their customers, then we filter out only the outlet that has the most customers, this is Dominicks 98 which we will look further into and store it in a separate dataset.
Code
tryCatch({df <-read_csv("sok-1005_data.csv")}, error =function(err) {df <<-read_csv("https://raw.githubusercontent.com/Danieljoha/Sok-1005-Data_science_project/main/sok-1005_data.csv")})df <- df %>%pivot_longer(cols =-c(date, store, week, move, qty, price, profit, sales, custcoun, haba, brand, name, city, zip, lat ,long),# Specifing the columns not to pivotnames_to ="variable_name",# Set the new column name for the valuesvalues_to ="dflong_value")#lager demo_stata igjendemo_stata <- df %>%ungroup() %>%select(-date, -brand) %>%group_by(store, name, city, zip, lat, long, variable_name, dflong_value) %>%summarise(across(1:8, \(x) sum(x, na.rm =FALSE)), .groups ="keep") %>%ungroup() %>%select(-week)df <- df %>%filter(variable_name %in%c("% Population under age 9", NA)) %>%select(-name, -city, -zip, -lat, -long, -variable_name, -dflong_value)
Code
# Filtering for the most relevant store to plotdominicks_98 <- df %>%group_by(store) %>%summarise(total_custcoun =sum(custcoun)) %>%filter(total_custcoun ==max(total_custcoun)) %>%left_join(df, by ="store") %>%select(-total_custcoun)# Pivoting to use facet wrap to show more plotsdominicks_98_long <- dominicks_98 %>%pivot_longer(cols =c(sales, haba,custcoun, profit), names_to ="names", values_to ="values")
Code
# Starting the plotdominicks_98_long %>%ggplot(aes(x = date, y = values, fill = names)) +geom_col() +# using col for bar charttheme_minimal()+# themelabs(x ="", y ="Verdier", # changing some namesfill ="names", labels="", title="Dominicks 98") +scale_fill_manual(values =c("custcoun"="#276DB6", # Changing colors manually"haba"="#B627A5","profit"="cornflowerblue","sales"="purple"))+# facet for 4 different figures in same plotfacet_wrap(~ names, scales="free", labeller =labeller(names =c("custcoun"="Customer count", "haba"="Total cosmetic sales", "profit"="Profit", "sales"="Sales")))+# Setting some adjustments in the theme for positions of title and legendtheme(legend.position ="none", axis.text.x =element_text(angle =45, hjust =1),plot.title =element_text(hjust =0.4, size=20), strip.text = (element_text(size =17)))+# Fixing the x-axis to show monthsscale_x_date(date_breaks ="1 month", date_labels ="%b")
Here you see weekly customer counts for Dominicks 98 which has the most customers of all the Dominicks stores in Chicago. Dominicks 98 was closed in October 1994 so we don’t have data for this month but it was still the best performing store. Further, we can see total cosmetic sales, which is the weekly turnover for shampoo in this store. Gross profit is the difference in profit between purchase price and selling price. Total cosmetic sales and gross profit are in dollars. Sales is a variable that we created via the recipe in the manual, according to the manual this shows us total dollar sales including things that are in bundles.
Code
# We ended up not using this figure, if you wondered why there is a code-fold here#| column: pagebrand_colors <-c("Bold Hold"="#1f77b4","Head and Shoulders"="#ff7f0e","Ivory"="#2ca02c","Spirit"="#d62728","Vo5"="#9467bd")old_fig <- dominicks_98 %>%mutate(haba = haba /1000,brand =case_when( brand =="bold_hold"~"Bold Hold", brand =="head_and_shoulders"~"Head and Shoulders", brand =="ivory"~"Ivory", brand =="spirit"~"Spirit", brand =="vo5"~"Vo5" )) %>%plot_ly() %>%add_trace(x =~date, y =~haba, type ='bar', split =~brand, color =~brand, colors = brand_colors,text =~brand, hovertemplate ="Brand: %{text}<br>Value: %{y}<extra></extra>") %>%layout(title =list(text ="Dominicks 98", xref ="paper", x =0.4),xaxis =list(title ="", tickformat ="%b", dtick ="M1"),yaxis =list(title ="Verdier i dollar(1000)"),legend =list(orientation ="h", x =0.5, y =-0.1, xanchor ="center"),barmode ='stack',annotations =list(list(text ="Weekly sales for shampoo", xref ="paper", yref ="paper", x =0.4, y =0.99, showarrow = F, font =list(size =12)) )) %>%config(displayModeBar =FALSE)#old_fig
Code
# Defining colors for each branddominicks_98_colors <- dominicks_98 %>%mutate(brand =case_when( brand =="bold_hold"~"Bold Hold", brand =="head_and_shoulders"~"Head and Shoulders", brand =="ivory"~"Ivory", brand =="spirit"~"Spirit", brand =="vo5"~"Vo5") ) %>%mutate(color = brand_colors[brand]) %>%mutate(week =as.integer(format(date, "%W")))dominicks_98_plot <- dominicks_98_colors %>%split(.$brand) %>%map(~ggplot(data = .x, aes(x = week, y = haba /1000, fill = brand,text =paste0("Week: ", week, "<br>","Brand: ", brand, "<br>","Health and beauty sales in thousands: <br>$", haba /1000))) +geom_col() +scale_fill_manual(values =unique(.x$color)) +theme_minimal() +labs(x ="Week", y ="Dollar sales for each brand (1000)") +scale_x_continuous(expand =c(0, 0), breaks =seq(1, 52, by =1), labels =seq(1, 52, by =1)) +theme(axis.text.x =element_text(angle =45, hjust =1, size =7), plot.title =element_text(hjust =0.4)) +ggtitle(paste(unique(.x$brand))) )# Convert ggplot objects to Plotly plotsdominicks_98_plotly <- dominicks_98_plot %>%map(~ggplotly(.x, tooltip ="text"))# Print each modified plotdominicks_98_plotly[[1]]dominicks_98_plotly[[2]]dominicks_98_plotly[[3]]dominicks_98_plotly[[4]]dominicks_98_plotly[[5]]
Code
brand_list <-c("Bold Hold", "Head and Shoulders", "Ivory", "Spirit", "Vo5")brand_map <-c("bold_hold"="Bold Hold", "head_and_shoulders"="Head and Shoulders","ivory"="Ivory","spirit"="Spirit","vo5"="Vo5")for (brand in brand_list) { original_brand_name <-names(brand_map)[brand_map == brand] # get the original brand name brand_data <- dominicks_98 %>%mutate(haba = haba /1000,brand =case_when( brand == original_brand_name ~ brand )) %>%filter(brand == brand)plot_ly(brand_data) %>%add_trace(x =~date, y =~haba, type ='bar', color =~brand, colors = brand_colors[brand],text =~brand, hovertemplate =paste("Brand: ", brand, "<br>Value: %{y}<extra></extra>")) %>%layout(title =list(text =paste("Dominicks 98 - ", brand), xref ="paper", x =0.4),xaxis =list(title ="", tickformat ="%b", dtick ="M1"),yaxis =list(title ="Verdier i dollar(1000)"),legend =list(orientation ="h", x =0.5, y =-0.1, xanchor ="center"),annotations =list(list(text =paste("Weekly sales for", brand), xref ="paper", yref ="paper", x =0.4, y =0.99, showarrow = F, font =list(size =12)) )) %>%config(displayModeBar =FALSE)}
In figure 2, we look at which brand sells best among the shampoos we have selected with a lot of movement. You can see that Vo5 clearly has the largest sales volume measured in dollars, and that Head and Shoulders is not as big as they are today.
Code
# List of variables of interestvariables2 <-list(c('% of population that is white','% of population that is neither black & hispanics or white', '% Blacks & Hispanics'),c("% of people that arent college graduates", "% College Graduates"),c("% of population with income under $15,000", "% of population with income over $15,000"),c("% of households without mortages", "% of households with mortgages"))# List to store the resulting datasetsdatasets <-list()# Loop through the variables and create the corresponding datasetsfor (i inseq_along(variables2)) { datasets[[i]] <- demo_stata %>%filter(store =="98") %>%unique() %>%filter(variable_name %in% variables2[[i]])}
In figure 3, the group management can see which customer group shops the most at Dominicks 98, those who contribute the most to sales in this particular store are 81% white, 1.2% other, and 17.8% black or Latin American. Further in figures 4, 5 and 6, we look at whether the customers are students, above the poverty line, and whether the customers have mortgages.
Code
variable_names <-list(c('% of population that is white','% of population that is neither black & hispanics or white', '% Blacks & Hispanics'),c("% of people that arent college graduates", "% College Graduates"),c("% of population with income under $15,000", "% of population with income over $15,000"),c("% of households without mortages", "% of households with mortgages"))create_pie_chart <-function(dataset, variable_names) { pie_data <- dataset %>%filter(variable_name %in% variable_names) %>%mutate(display_value =round(dflong_value *100, 2),custom_text =paste0(variable_name, ": ", display_value, "%")) colors <-c('blueviolet', '#B627A5', 'cornflowerblue') pie_chart <-plot_ly(pie_data, labels =~variable_name, values =~display_value, type ='pie', textinfo ='percent', marker =list(colors = colors), text =~custom_text, hovertemplate ="%{label}: <br>%{value}%<extra></extra>") %>%layout(title ="", legend =list(orientation ="h", x =0.5, y =-0.3, xanchor ="center")) %>%config(displayModeBar =FALSE)return(pie_chart)}plots <-pmap(list(datasets, variable_names), create_pie_chart)
Code
plots[[1]] plots[[2]] plots[[3]] plots[[4]]
Task 3
It is important to note that because it is weekly data we have redefined our week 225 to start at 1994. Because we are using monthly data now it shouldn’t make a large difference but the data now includes the final 4 days from 1993 and doesnt include the last 4 days of 1994. This means January had a few days of data from December 1993, February has a few days of January and so on.
First, we will present monthly values for all Dominick’s stores from the entire chain. We can see these in the figure below.
Code
df_monthly %>%pivot_longer(cols =c(sales, haba,custcoun, profit), names_to ="names", values_to ="values") %>%ggplot(aes(x = month, y = values, fill = names)) +geom_col() +# using col for bar charttheme_minimal()+# themelabs(x ="", y ="Verdier", # changing some namesfill ="names", labels="", title="All Dominicks stores. Monthly values") +scale_fill_manual(values =c("custcoun"="#276DB6", # Changing colors manually"haba"="#B627A5","profit"="cornflowerblue","sales"="purple"))+# facet for 4 different figures in same plotfacet_wrap(~ names, scales="free", labeller =labeller(names =c("custcoun"="Customer count", "haba"="Total cosmetic sales", "profit"="Profit", "sales"="Sales")))+# Setting some adjustments in the theme for positions of title and legendtheme(legend.position ="none", axis.text.x =element_text(angle =45, hjust =1),plot.title =element_text(hjust =0.4, size=20),strip.text =element_text(size =15))+scale_x_continuous(expand=c(0,0), n.breaks =12)
Next, we look at which brands are doing well. As in the previous task, we can see that Vo5 is selling the best across all of Chicago. NB: The y-axis has different scales on the different plots.
Code
# Define colors for each brandbrand_colors <-c("Bold Hold"="#1f77b4","Head and Shoulders"="#ff7f0e","Ivory"="#2ca02c","Spirit"="#d62728","Vo5"="#9467bd")df_monthly_with_colors <- df_monthly %>%mutate(brand =case_when( brand =="bold_hold"~"Bold Hold", brand =="head_and_shoulders"~"Head and Shoulders", brand =="ivory"~"Ivory", brand =="spirit"~"Spirit", brand =="vo5"~"Vo5")) %>%# Add a color column based on brandmutate(color = brand_colors[brand])plots_monthly <-df_monthly_with_colors %>%split(.$brand) %>%map(~ggplot(data = .x, aes(x = month, y = haba/1000, fill = color)) +geom_col() +scale_fill_identity() +theme_minimal() +scale_x_continuous(expand =c(0,0), n.breaks =12) +labs(x="",y="Dollar sales for each brand(1000)")+theme(plot.title =element_text(hjust =0.4))+ggtitle(paste(unique(.x$brand))))# Print each plotinvisible(map(plots_monthly, print))
The data we have can be used to find new and suitable retail outlets. What we can do, for example, is to look at the best retail outlets we already have over time in aggregate, then take a regression analysis of, for example, the 5 or 10 best places and see if the customer flow has changed positively over time. If we then manage to find a location that is popular and has an increasing customer group, we can see if there are any locations nearby where we can set up a new store to cater to the increasing customer flow.
Code
#making an empty listmodel_list <-list()#looping to add to the lsitfor (store_id inunique(df$store)) {#Subset the data for the current store store_data <-subset(df, store == store_id)#fitting the model model <-lm(custcoun ~ date, data = store_data)#adding the model and coefficients into the list model_list[[as.character(store_id)]] <-list(model = model, coef =coef(model))}#Getting the coefficients and r-squared values for all storescoef_df <-data.frame(do.call(rbind, lapply(model_list, function(x) x$coef)))
Shows top 10 stores with increasing customer trend.
# Plotting the most promising storedf %>%filter(store ==8) %>%ggplot(aes(x=date,y=custcoun)) +geom_point()+geom_smooth(method=lm, se=FALSE)+labs(x="", y="Customers", title="Dominicks 8 customers regression")+theme_minimal()+scale_x_date(date_breaks ="1 month", date_labels ="%b")
The results from the regression analysis show that store 8 has the most increasing customer flow. If we look at the map above, this location has good coverage of Dominicks stores. What we should be able to see is what this store does better than the others we have in the area already, and if a new store is to be set up, it should be further east. Otherwise, when we look at the map above, the location that could have more stores is West Chicago, for example with Winfield.
Dominicks 74 is the store with the most declining customers.
Code
table <- coef_df %>%arrange(-date) %>%rownames_to_column(var ="store") %>%mutate(store =as.numeric(store))palette <-colorNumeric(palette =c("red", "green"), domain = loc_data$sales)loc_data %>%mutate(lat = lat/10000, long = long/10000*-1) %>%leaflet() %>%addTiles(options =tileOptions(minZoom=9, maxZoom=13)) %>%addCircleMarkers(clusterOptions =markerClusterOptions(maxClusterRadius =20),lat =~lat, lng =~long, #locations for markersradius =12, #circle sizecolor =~palette(sales), #color of circle based on salesfillOpacity =0.5,stroke =FALSE, #no outlinelabel =~as.character(store), #store labelslabelOptions =labelOptions(noHide =TRUE, direction ="center", textOnly =TRUE, fontSize =16, fontWeight ="bold"), popup =paste0("<strong>Store:</strong>", loc_data$store, "<br><strong>Name:</strong>", loc_data$name, "<br><strong>City:</strong>", loc_data$city, "<br><strong>Zip:</strong>", loc_data$zip,"<br><strong>Profit from shampoo sales:</strong> $", loc_data$profit,"<br><strong>Total shampoo sales:</strong> $", loc_data$sales,"<br><strong>Customer count</strong>(thousands): ", round(loc_data$custcoun/1000),"<br><strong>Health and beauty sales</strong>(thousands): ", round(loc_data$haba/1000))) %>%addLegend(pal = palette, values =~sales, title ="Shampoo total<br> dollar sales ($)")
On this interactive map, we can finally see that it is primarily the central Chicago stores that are doing the best and those on the outskirts that are doing the worst. Our conclusion is that it is in those areas in the middle of Chicago with the least coverage where it will be most profitable to set up a new store or multiple new stores. And by running a code that we had help from ChatGPT to see there might be some form of correlation. However for the correlation numbers it is important to note that statsistical significance is important when interpreting correlations, the correlation may be not zero but if its not statistically significant then it may not be meaningful to look at. We have not used any form of statistical test as we lack the knowledge to do this in a meaningful way but we felt it may be relevant to look at.
Code
correlations <- demo_stata %>%pivot_wider(names_from = variable_name, values_from = dflong_value) #This code is with the use of ChatGPT# List of demographic columnsdemographic_cols <-colnames(correlations)[5:6]# Initialize an empty vector to store correlation resultscorrelation_results <-c()#This code is with the use of ChatGPT# Calculate correlation for each demographic variablefor (col in demographic_cols) { correlation <-cor(correlations$sales, correlations[[col]], use ="pairwise.complete.obs") correlation_results <-c(correlation_results, correlation)}# Combine demographic column names and correlation results into a dataframecorrelation_df <-data.frame(Demographic = demographic_cols, Correlation = correlation_results)#This code is with the use of ChatGPT# Print the correlation dataframecorrelation_df
Demographic
Correlation
lat
0.1445141
long
-0.1655860
This shows us that there is maybe a good idea to have stores with a higher value of latitude which is to the north, and a lower longtitude which is to the east so having your store in the north east chigago area might be something to look into.
Code
correlations <- demo_stata %>%pivot_wider(names_from = variable_name, values_from = dflong_value) #This code is with the use of ChatGPT# List of demographic columnsdemographic_cols <-colnames(correlations)[14:ncol(correlations)]# List of variables for which you want to calculate correlationsvariables <-c("sales", "profit", "custcount", "haba")# Initialize an empty dataframe to store correlation resultscorrelation_df <-data.frame()#This code is with the use of ChatGPT# Calculate correlation for each demographic variablefor (var in variables) { correlation_results <-c()for (col in demographic_cols) {# Ensure both columns are numeric before calculating correlationif (is.numeric(correlations[[var]]) &is.numeric(correlations[[col]])) { correlation <-cor(correlations[[var]], correlations[[col]], use ="pairwise.complete.obs") correlation_results <-c(correlation_results, correlation) } else { correlation_results <-c(correlation_results, NA) } } temp_df <-data.frame(Demographic = demographic_cols, Correlation = correlation_results, Variable = var) correlation_df <-rbind(correlation_df, temp_df)}# Subset the dataframe to keep only rows with correlation greater than 0.1 or less than -0.1. #This code is with the use of ChatGPTcorrelation_df <-subset(correlation_df, abs(Correlation) >0.1)#This code is with the use of ChatGPT#rownames(correlation_df) <- NULL# Print the correlation dataframecorrelation_df
Demographic
Correlation
Variable
5
% Working Women with full-time jobs
-0.1284505
sales
8
% of Retired
0.1091721
sales
10
% of Unemployed
0.1189193
sales
15
% of households with mortgages
-0.2543562
sales
16
% of households without mortages
0.2543562
sales
21
% of population that is neither black & hispanics or white
0.1850833
sales
26
% of women that arent working
0.1284505
sales
27
% of working women with children
-0.2062318
sales
28
% of working women with children 6 - 17
-0.2004089
sales
31
% Blacks & Hispanics
-0.2206697
profit
32
% College Graduates
-0.1124497
profit
45
% of households with mortgages
-0.1341649
profit
46
% of households without mortages
0.1341649
profit
47
% of non-working women with children
0.1022449
profit
49
% of people that arent college graduates
0.1124497
profit
52
% of population that is non-white
-0.2049068
profit
53
% of population that is white
0.2049068
profit
93
% Population over age 60
-0.1037599
haba
94
% Population under age 9
0.1508498
haba
100
% of Unemployed
0.2141608
haba
102
% of households with 2 persons
-0.1555487
haba
The variable “haba” means health and beauty sales.
What we can see here is that there might be some signs of there being some correlation with demographic data but we can not be sure. Further analysis is needed.
References
The data used for this assignment is gathered at the following links
We have also used the resources from the courseplan in SOK-1005 and used the lectures given by Dejene Gizaw Kidane to help us create the code to wrangle the data so there may be similarities at places in Task 1. However any and all errors are ours and ours alone.
Some of the code used to create some of the figures and datawrangling has been taken from our delivered assignments. The candidat numbers are added in the top of the document and we have delivered the assignments in Canvas with links to our Github pages. However for the purpose of anonymity of the authors of this task, our github pages contain our names so this should only first be checked after grading is final.
Appendix for AI usage
Task 1
In the beginning of Task 1 there has been use for ChatGPT to help define the start of “week 1” as we were unsure of how to convert the weekly data to date but it gave us the idea which we used to create the code. Codeline 136 and 146
We used ChatGPT to enter in the variable names after posting the variable names from the PDF. Codelines 157-67
In the code line 251 we had help from ChatGPT to write the REGEX code to help us pick out the brands as the brands had weird names with changing use of letters and such.
Finally we have used ChatGPT to help us translate the documentation we had made for Task1 to english by promting it with “translate to english” and adding our text. This was done to help grade the task.
Task 2
Helped make a trycatch code so if file isnt found, it is gotten from github where it is hosted
ChatGPT has been used to help create a better structure in the text we have made. This has been done by promting the model with our text and asking it to improve it. With some translations
Task 3
Minimal usage
Again used to help create better structure in the text and at times translate our norwegian comments.
Task 4
The use of ChatGPT is shown on this link and we translated one section to english in task 3.
The link shows us promting ChatGPT saying we have a dataframe in R and explaining a bit of the dataframe. The AI then gives us some code and i then specify we want specific columns to be checked. It then gives the code that is used at the end of Task 4 and it has been changed minimally.
Lastly it shows that i asked it to translate a section of Task 3 to to english which was done to help grade this task.